Skip to content

fix(analytics): thread-safe polling start + surface fetch errors (SDK-81, SDK-78)#155

Open
tylerjroach wants to merge 3 commits into
masterfrom
fix/sdk-85-thread-safe-polling-start
Open

fix(analytics): thread-safe polling start + surface fetch errors (SDK-81, SDK-78)#155
tylerjroach wants to merge 3 commits into
masterfrom
fix/sdk-85-thread-safe-polling-start

Conversation

@tylerjroach

@tylerjroach tylerjroach commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Bundles two small audit fixes in the polling loop — both local_flags_provider.rb, both touching the same begin/rescue block.

SDK-81 — thread-safe polling start

start_polling_for_definitions! had a check-then-act race: if @config[:enable_polling] && !@polling_thread ran unsynchronized, so two concurrent callers could both see @polling_thread == nil, both assign Thread.new { ... }, and the earlier thread would become orphaned but keep polling forever.

Add a @polling_mutex guarding the polling-thread lifecycle:

  • start_polling_for_definitions!: under the mutex, bail early if @polling_thread already exists, otherwise spawn the thread.
  • stop_polling_for_definitions!: snapshot @polling_thread and @stop_polling under the mutex, then join outside the mutex so a long-running join can't block a concurrent start! for a full poll interval.

SDK-78 — surface polling-loop errors

The loop caught StandardError and dispatched only to @error_handler, whose default Mixpanel::ErrorHandler#handle is a no-op. Schema drift (NoMethodError, JSON::ParserError, TypeError, …) looped forever undetected unless the user had configured a custom handler.

Add a log_polling_error helper that always warns to $stderr before dispatching to @error_handler. Visibility no longer depends on configuration. Matches the de facto convention across every other Mixpanel SDK polling loop (mixpanel-python logger.exception, mixpanel-java logger.log(WARNING), mixpanel-go log.Printf, mixpanel-node this.logger?.error) — all log unconditionally and keep polling. Crashing the polling thread on the first bad payload would freeze local evaluation at whatever variants were last cached — strictly worse than continuing.

Context

Linear: SDK-81, SDK-78. Same audit-driven cross-SDK push; for SDK-81, Ruby is one of three affected (also Java #91, Node #278). SDK-78 was Ruby-only — the other SDKs already log unconditionally.

Test plan

  • Full flags spec suite (47 examples) passes
  • "is idempotent under concurrent start calls and does not leak threads" (SDK-81) spins up 8 contender threads behind a Queue gate; asserts only 1 new background thread exists after they all return. Before the fix the count would be 8.
  • "warns to stderr when fetch raises and no error_handler is configured" (SDK-78) builds the provider with nil error_handler + 500 response, asserts the warn lands on stderr.
  • "surfaces unexpected errors (schema drift) instead of swallowing them silently" (SDK-78) injects NoMethodError via allow(...).to receive(:fetch_flag_definitions), asserts both the error_handler dispatch AND the stderr warning fire.

🤖 Generated with Claude Code

…utors

startPollingForDefinitions had no lock around the pollingExecutor
create-and-assign. Two concurrent callers would each allocate a fresh
ScheduledExecutorService and the earlier one's worker thread + queue
would leak (still alive, still scheduled to poll, no way to shut it
down because the field had been overwritten).

Guard the executor lifecycle with a private lock and bail early if a
poller is already scheduled. Snapshot the executor under the lock in
stop, then run the (potentially blocking) shutdown / awaitTermination
outside the lock so a long-running shutdown can't block a concurrent
start for 5s.

Linear: SDK-85

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@tylerjroach tylerjroach requested review from a team and tdumitrescu June 29, 2026 17:42
@linear-code

linear-code Bot commented Jun 29, 2026

Copy link
Copy Markdown

SDK-85

SDK-81

SDK-78

@codecov

codecov Bot commented Jun 29, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 93.33333% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 96.69%. Comparing base (e5edf71) to head (18d2131).

Files with missing lines Patch % Lines
lib/mixpanel-ruby/flags/local_flags_provider.rb 93.33% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #155      +/-   ##
==========================================
+ Coverage   96.64%   96.69%   +0.04%     
==========================================
  Files          14       14              
  Lines         656      665       +9     
==========================================
+ Hits          634      643       +9     
  Misses         22       22              
Flag Coverage Δ
openfeature 100.00% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@tylerjroach tylerjroach changed the title fix(flags): guard polling start so concurrent callers don't leak threads (SDK-85) fix(analytics): guard polling start so concurrent callers don't leak threads (SDK-85) Jun 29, 2026
@tylerjroach tylerjroach changed the title fix(analytics): guard polling start so concurrent callers don't leak threads (SDK-85) fix(analytics): guard polling start so concurrent callers don't leak threads (SDK-81) Jun 29, 2026
…p silently

start_polling_for_definitions! caught StandardError and dispatched only
to @error_handler (whose default ErrorHandler#handle is a no-op), so
NoMethodError / JSON::ParserError / etc. from schema drift would loop
forever undetected unless the user had configured a custom handler.

Add log_polling_error which always warns to STDERR before dispatching
to @error_handler — visibility no longer depends on configuration.
Matches the convention in mixpanel-python, mixpanel-java, mixpanel-go,
and mixpanel-node, all of which log unconditionally and continue
polling.

Linear: SDK-78

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@tylerjroach tylerjroach changed the title fix(analytics): guard polling start so concurrent callers don't leak threads (SDK-81) fix(analytics): thread-safe polling start + surface fetch errors (SDK-81, SDK-78) Jun 29, 2026
@tylerjroach

Copy link
Copy Markdown
Contributor Author

@greptileai

@greptile-apps

greptile-apps Bot commented Jul 2, 2026

Copy link
Copy Markdown

Confidence Score: 5/5

The changes are well-scoped and the core logic — per-thread stop flags, mutex-guarded lifecycle, join outside the mutex — correctly addresses both races. One minor test-reliability nuance exists but does not affect production behavior.

The per-thread stop signal correctly prevents the stop+start zombie-thread race. The mutex eliminates the duplicate-spawn race. Production code is sound. The only observation is a narrow scheduling window in the concurrent idempotency test where the poller thread's name may not yet be set when the assertion runs, but this is unlikely to manifest in practice and does not affect production correctness.

No files require special attention; the suggestion is in local_flags_provider.rb around thread-name assignment timing.

Important Files Changed

Filename Overview
lib/mixpanel-ruby/flags/local_flags_provider.rb Adds per-thread stop signals and a polling mutex to fix the check-then-act race (SDK-81), and adds log_polling_error for unconditional stderr visibility (SDK-78). Logic is sound; minor note on thread-name assignment timing.
spec/mixpanel-ruby/flags/local_flags_spec.rb Adds four new tests covering concurrent idempotency, stop+start correctness, stderr visibility without an error handler, and schema-drift error dispatch. The stop+start and SDK-78 tests are robust; the concurrent test filters by thread name but asserts immediately after join with no scheduling gap, which can be flaky if the thread hasn't set its name yet.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant C1 as Caller 1
    participant C2 as Caller 2
    participant M as @polling_mutex
    participant T as PollerThread

    C1->>C1: fetch_flag_definitions (initial)
    C2->>C2: fetch_flag_definitions (initial)
    C1->>M: "synchronize { @polling_thread nil? → spawn }"
    Note over M: lock held by C1
    C2->>M: "synchronize { wait... }"
    M-->>T: "Thread.new (stop_flag=[false])"
    C1->>M: release
    C2->>M: "acquire { @polling_thread exists → return }"
    Note over T: loop: sleep → check stop_flag[0] → fetch

    C1->>M: "stop: set stop_flag[0]=true, @polling_thread=nil"
    C1->>T: join (outside mutex)
    C2->>M: "start: @polling_thread nil → spawn new thread"
    Note over T: wakes, stop_flag[0]==true → break ✓
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant C1 as Caller 1
    participant C2 as Caller 2
    participant M as @polling_mutex
    participant T as PollerThread

    C1->>C1: fetch_flag_definitions (initial)
    C2->>C2: fetch_flag_definitions (initial)
    C1->>M: "synchronize { @polling_thread nil? → spawn }"
    Note over M: lock held by C1
    C2->>M: "synchronize { wait... }"
    M-->>T: "Thread.new (stop_flag=[false])"
    C1->>M: release
    C2->>M: "acquire { @polling_thread exists → return }"
    Note over T: loop: sleep → check stop_flag[0] → fetch

    C1->>M: "stop: set stop_flag[0]=true, @polling_thread=nil"
    C1->>T: join (outside mutex)
    C2->>M: "start: @polling_thread nil → spawn new thread"
    Note over T: wakes, stop_flag[0]==true → break ✓
Loading

Reviews (2): Last reviewed commit: "fix(flags): per-thread stop signal so st..." | Re-trigger Greptile

Comment thread lib/mixpanel-ruby/flags/local_flags_provider.rb
Comment thread spec/mixpanel-ruby/flags/local_flags_spec.rb Outdated
… poller

stop_polling_for_definitions! flipped a shared @stop_polling boolean.
A concurrent start_polling_for_definitions! could then reset it to
false before the old thread woke from its sleep, leaving the old
poller running indefinitely and orphaned (no longer tracked by
@polling_thread, so no future stop could join it).

Each polling thread now captures its own [false] cell by closure. A
subsequent start allocates a fresh cell, so the previous thread still
observes true and exits on its next wake.

Also name the polling thread ('mixpanel-flags-poller') so the concurrent-
start regression test can count our threads by name instead of diffing
Thread.list.size, which is fragile under other background threads in
the same process.
@tylerjroach

Copy link
Copy Markdown
Contributor Author

Pushed d8682c5 addressing both Greptile threads.

P1 — concurrent stop+start race (local_flags_provider.rb:65): replaced the shared @stop_polling boolean with a per-thread [false] cell captured by closure. Each polling thread checks its own cell, so a subsequent start_polling_for_definitions! allocating a fresh cell cannot clear the previous thread's stop signal. Also named the thread mixpanel-flags-poller for debuggability. Added a new spec stop+start does not leave the previous poller running that would fail on the old shared-flag design.

P2 — fragile Thread.list.size baseline (local_flags_spec.rb:790): the concurrent-start test now filters by t.name == 'mixpanel-flags-poller' instead of diffing Thread.list.size, so unrelated background threads (RSpec workers, GC finalizers) can't perturb the assertion.

All 48 specs pass locally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant